PCA (using tidymodels) with wine dataset
PCA is a data reduction technique, to uncover latent variables that are uncorrelated. It is an unsupervised way of classification. Not all of the variables in high-dimensional data are required. Some are highly correlated with others and these variables may be omitted, while retaining a similar level of information in the dataset in terms of explaining the variance.
It is used as an exploratory data analysis tool, and may be used for feature engineering and/or clustering.
Always use only continuous variables, ensure that there are no missing data. Determine the number of components using eigenvalues, scree plots and parallel analysis.
The scores plot show the positions of the individual wine samples in the coordinate system of the PCs.
The loadings plot shows the contribution of the X variables to the PCs.
This dataset is from the kohonen package. It contains 177 rows and 13 columns.
These data are the results of chemical analyses of wines grown in the same region in Italy (Piedmont) but derived from three different cultivars: Nebbiolo, Barberas and Grignolino grapes. The wine from the Nebbiolo grape is called Barolo. The data contain the quantities of several constituents found in each of the three types of wines, as well as some spectroscopic variables.
The dataset requires some cleaning, and the type of wine was added to the datset.
data(wines)
wines <- as.data.frame(wines) %>%
janitor::clean_names() %>% # require data.frame
as_tibble() %>%
cbind(vintages) # vintages = Y outcome = category
glimpse(wines)
Rows: 177
Columns: 14
$ alcohol <dbl> 13.20, 13.16, 14.37, 13.24, 14.20, 14.39, 1…
$ malic_acid <dbl> 1.78, 2.36, 1.95, 2.59, 1.76, 1.87, 2.15, 1…
$ ash <dbl> 2.14, 2.67, 2.50, 2.87, 2.45, 2.45, 2.61, 2…
$ ash_alkalinity <dbl> 11.2, 18.6, 16.8, 21.0, 15.2, 14.6, 17.6, 1…
$ magnesium <dbl> 100, 101, 113, 118, 112, 96, 121, 97, 98, 1…
$ tot_phenols <dbl> 2.65, 2.80, 3.85, 2.80, 3.27, 2.50, 2.60, 2…
$ flavonoids <dbl> 2.76, 3.24, 3.49, 2.69, 3.39, 2.52, 2.51, 2…
$ non_flav_phenols <dbl> 0.26, 0.30, 0.24, 0.39, 0.34, 0.30, 0.31, 0…
$ proanth <dbl> 1.28, 2.81, 2.18, 1.82, 1.97, 1.98, 1.25, 1…
$ col_int <dbl> 4.38, 5.68, 7.80, 4.32, 6.75, 5.25, 5.05, 5…
$ col_hue <dbl> 1.05, 1.03, 0.86, 1.04, 1.05, 1.02, 1.06, 1…
$ od_ratio <dbl> 3.40, 3.17, 3.45, 2.93, 2.85, 3.58, 3.58, 2…
$ proline <dbl> 1050, 1185, 1480, 735, 1450, 1290, 1295, 10…
$ vintages <fct> Barolo, Barolo, Barolo, Barolo, Barolo, Bar…
Some exploratory data analysis was carried out:
skim(wines) # 177 x 13, all numeric + Y outcome
Name | wines |
Number of rows | 177 |
Number of columns | 14 |
_______________________ | |
Column type frequency: | |
factor | 1 |
numeric | 13 |
________________________ | |
Group variables | None |
Variable type: factor
skim_variable | n_missing | complete_rate | ordered | n_unique | top_counts |
---|---|---|---|---|---|
vintages | 0 | 1 | FALSE | 3 | Gri: 71, Bar: 58, Bar: 48 |
Variable type: numeric
skim_variable | n_missing | complete_rate | mean | sd | p0 | p25 | p50 | p75 | p100 | hist |
---|---|---|---|---|---|---|---|---|---|---|
alcohol | 0 | 1 | 12.99 | 0.81 | 11.03 | 12.36 | 13.05 | 13.67 | 14.83 | ▂▇▇▇▃ |
malic_acid | 0 | 1 | 2.34 | 1.12 | 0.74 | 1.60 | 1.87 | 3.10 | 5.80 | ▇▅▂▂▁ |
ash | 0 | 1 | 2.37 | 0.28 | 1.36 | 2.21 | 2.36 | 2.56 | 3.23 | ▁▂▇▅▁ |
ash_alkalinity | 0 | 1 | 19.52 | 3.34 | 10.60 | 17.20 | 19.50 | 21.50 | 30.00 | ▁▆▇▃▁ |
magnesium | 0 | 1 | 99.59 | 14.17 | 70.00 | 88.00 | 98.00 | 107.00 | 162.00 | ▅▇▃▁▁ |
tot_phenols | 0 | 1 | 2.29 | 0.63 | 0.98 | 1.74 | 2.35 | 2.80 | 3.88 | ▅▇▇▇▁ |
flavonoids | 0 | 1 | 2.02 | 1.00 | 0.34 | 1.20 | 2.13 | 2.86 | 5.08 | ▆▆▇▂▁ |
non_flav_phenols | 0 | 1 | 0.36 | 0.12 | 0.13 | 0.27 | 0.34 | 0.44 | 0.66 | ▃▇▅▃▂ |
proanth | 0 | 1 | 1.59 | 0.57 | 0.41 | 1.25 | 1.55 | 1.95 | 3.58 | ▃▇▆▂▁ |
col_int | 0 | 1 | 5.05 | 2.32 | 1.28 | 3.21 | 4.68 | 6.20 | 13.00 | ▇▇▃▂▁ |
col_hue | 0 | 1 | 0.96 | 0.23 | 0.48 | 0.78 | 0.96 | 1.12 | 1.71 | ▅▇▇▃▁ |
od_ratio | 0 | 1 | 2.60 | 0.71 | 1.27 | 1.93 | 2.78 | 3.17 | 4.00 | ▆▃▆▇▂ |
proline | 0 | 1 | 745.10 | 314.88 | 278.00 | 500.00 | 672.00 | 985.00 | 1680.00 | ▇▇▅▃▁ |
wines %>%
select(-vintages) %>%
ggcorr(label = T, label_alpha = T, label_round = 2)
wines %>%
ggpairs(aes(col = vintages))
Is the dataset suitable for PCA analysis?
# Continuous Y
# No missing data
# Check assumptions for PCA #####
wines_no_y <- wines %>%
select(-vintages)
glimpse(wines_no_y)
Rows: 177
Columns: 13
$ alcohol <dbl> 13.20, 13.16, 14.37, 13.24, 14.20, 14.39, 1…
$ malic_acid <dbl> 1.78, 2.36, 1.95, 2.59, 1.76, 1.87, 2.15, 1…
$ ash <dbl> 2.14, 2.67, 2.50, 2.87, 2.45, 2.45, 2.61, 2…
$ ash_alkalinity <dbl> 11.2, 18.6, 16.8, 21.0, 15.2, 14.6, 17.6, 1…
$ magnesium <dbl> 100, 101, 113, 118, 112, 96, 121, 97, 98, 1…
$ tot_phenols <dbl> 2.65, 2.80, 3.85, 2.80, 3.27, 2.50, 2.60, 2…
$ flavonoids <dbl> 2.76, 3.24, 3.49, 2.69, 3.39, 2.52, 2.51, 2…
$ non_flav_phenols <dbl> 0.26, 0.30, 0.24, 0.39, 0.34, 0.30, 0.31, 0…
$ proanth <dbl> 1.28, 2.81, 2.18, 1.82, 1.97, 1.98, 1.25, 1…
$ col_int <dbl> 4.38, 5.68, 7.80, 4.32, 6.75, 5.25, 5.05, 5…
$ col_hue <dbl> 1.05, 1.03, 0.86, 1.04, 1.05, 1.02, 1.06, 1…
$ od_ratio <dbl> 3.40, 3.17, 3.45, 2.93, 2.85, 3.58, 3.58, 2…
$ proline <dbl> 1050, 1185, 1480, 735, 1450, 1290, 1295, 10…
# KMO: Indicates the proportion of variance in the variables that may be caused by underlying factors. High values (close to 1) indicate that factor analysis may be useful.
wines_no_y %>%
cor() %>%
KMO() # .70 above : YES
Kaiser-Meyer-Olkin factor adequacy
Call: KMO(r = .)
Overall MSA = 0.78
MSA for each item =
alcohol malic_acid ash ash_alkalinity
0.73 0.80 0.44 0.68
magnesium tot_phenols flavonoids non_flav_phenols
0.67 0.87 0.81 0.82
proanth col_int col_hue od_ratio
0.85 0.62 0.79 0.86
proline
0.81
# Bartlett's test of sphericity: tests the hypothesis that the correlation matrix is an identity matrix (ie variables are unrelated and not suitable for structure detection.) For factor analysis, the p. value should be <0.05.
wines_no_y %>%
cor() %>%
cortest.bartlett(., n = 177) # p<0.05
$chisq
[1] 1306.787
$p.value
[1] 3.302319e-222
$df
[1] 78
With the use of update_role(), the types of wine information is retained in the dataset.
step_normalize() combines step_center() and step_scale()
Note that step_pca is the second step –> will need to retrieve the PCA results from the second list later.
wines_recipe <- recipe(~ ., data = wines) %>%
update_role(vintages, new_role = "id") %>%
# step_naomit(all_predictors()) %>%
step_normalize(all_predictors()) %>%
step_pca(all_predictors(), id = "pca")
wines_recipe # 13 predictors
Data Recipe
Inputs:
role #variables
id 1
predictor 13
Operations:
Centering and scaling for all_predictors()
No PCA components were extracted.
wines_prep <- prep(wines_recipe)
wines_prep # trained
Data Recipe
Inputs:
role #variables
id 1
predictor 13
Training data contained 177 data points and no missing data.
Operations:
Centering and scaling for alcohol, malic_acid, ... [trained]
PCA extraction with alcohol, malic_acid, ... [trained]
tidy_pca_loadings <- wines_prep %>%
tidy(id = "pca")
tidy_pca_loadings # values here are the loading
# A tibble: 169 x 4
terms value component id
<chr> <dbl> <chr> <chr>
1 alcohol -0.138 PC1 pca
2 malic_acid 0.246 PC1 pca
3 ash 0.00432 PC1 pca
4 ash_alkalinity 0.237 PC1 pca
5 magnesium -0.135 PC1 pca
6 tot_phenols -0.396 PC1 pca
7 flavonoids -0.424 PC1 pca
8 non_flav_phenols 0.299 PC1 pca
9 proanth -0.313 PC1 pca
10 col_int 0.0933 PC1 pca
# … with 159 more rows
wines_bake <- bake(wines_prep, wines)
wines_bake # has the PCA LOADING VECTORS that we are familiar with
# A tibble: 177 x 6
vintages PC1 PC2 PC3 PC4 PC5
<fct> <dbl> <dbl> <dbl> <dbl> <dbl>
1 Barolo -2.22 -0.301 -2.03 0.281 -0.259
2 Barolo -2.52 1.06 0.974 -0.734 -0.198
3 Barolo -3.74 2.80 -0.180 -0.575 -0.257
4 Barolo -1.02 0.886 2.02 0.432 0.274
5 Barolo -3.04 2.16 -0.637 0.486 -0.630
6 Barolo -2.45 1.20 -0.985 0.00466 -1.03
7 Barolo -2.06 1.64 0.143 1.20 0.0105
8 Barolo -2.51 0.958 -1.78 -0.104 -0.871
9 Barolo -2.76 0.822 -0.986 -0.374 -0.437
10 Barolo -3.48 1.35 -0.428 -0.0399 -0.316
# … with 167 more rows
# a. Eigenvalues: Keep components greater than 1
# data is stored in penguins_prep, step 3
wines_prep$steps[[2]]$res$sdev # 3
[1] 2.1628220 1.5815708 1.2055413 0.9614802 0.9282978 0.8030241
[7] 0.7429548 0.5922321 0.5377546 0.4967984 0.4748054 0.4103374
[13] 0.3224124
# b. Scree plot/Variance plot
wines_prep %>%
tidy(id = "pca", type = "variance") %>%
filter(terms == "percent variance") %>%
ggplot(aes(x = component, y = value)) +
geom_point(size = 2) +
geom_line(size = 1) +
scale_x_continuous(breaks = 1:4) +
labs(title = "% Variance explained",
y = "% total variance",
x = "PC",
caption = "Source: ChemometricswithR book") +
geom_text(aes(label = round(value, 2)), vjust = -0.3, size = 4) +
theme_minimal() +
theme(axis.title = element_text(face = "bold", size = 12),
axis.text = element_text(size = 10),
plot.title = element_text(size = 14, face = "bold")) # 2 or 3
# bii: Cumulative variance plot
wines_prep %>%
tidy(id = "pca", type = "variance") %>%
filter(terms == "cumulative percent variance") %>%
ggplot(aes(component, value)) +
geom_col(fill= "forestgreen") +
labs(x = "Principal Components",
y = "Cumulative variance explained (%)",
title = "Cumulative Variance explained") +
geom_text(aes(label = round(value, 2)), vjust = -0.2, size = 4) +
theme_minimal() +
theme(axis.title = element_text(face = "bold", size = 12),
axis.text = element_text(size = 10),
plot.title = element_text(size = 14, face = "bold"))
# c. Parallel analysis
fa.parallel(cor(wines_no_y),
n.obs = 333,
cor = "cor",
plot = T) # 3
Parallel analysis suggests that the number of factors = 4 and the number of components = 3
plot_loadings <- tidy_pca_loadings %>%
filter(component %in% c("PC1", "PC2", "PC3", "PC4")) %>%
mutate(terms = tidytext::reorder_within(terms,
abs(value),
component)) %>%
ggplot(aes(abs(value), terms, fill = value>0)) +
geom_col() +
facet_wrap( ~component, scales = "free_y") +
scale_y_reordered() + # appends ___ and then the facet at the end of each string
scale_fill_manual(values = c("deepskyblue4", "darkorange")) +
labs( x = "absolute value of contribution",
y = NULL,
fill = "Positive?",
title = "PCA Loadings Plot",
subtitle = "Number of PC should be 3, compare the pos and the neg",
caption = "Source: ChemometricswithR") +
theme_minimal()
plot_loadings
# PC1: flavonoids, tot_phenols, od_ratio, proanthocyanidins, col_hue, 36%
# PC2: col_int, alcohol, proline, ash, magnesium; 19.2%
# PC3: ash, ash_alkalinity, non_flav phenols; 11.2%
# PC4: malic acid?
An alternative way to plot:
# alternate plot loadings
learntidymodels::plot_top_loadings(wines_prep,
component_number <= 4, n = 5) +
scale_fill_manual(values = c("deepskyblue4", "darkorange")) +
theme_minimal()
# define arrow style
arrow_style <- arrow(angle = 30,
length = unit(0.2, "inches"),
type = "closed")
# get pca loadings into wider format
pca_loadings_wider <- tidy_pca_loadings%>%
pivot_wider(names_from = component, id_cols = terms)
pca_loadings_only <- pca_loadings_wider %>%
ggplot(aes(x = PC1, y = PC2)) +
geom_segment(aes(xend = PC1, yend = PC2),
x = 0,
y = 0,
arrow = arrow_style) +
ggrepel::geom_text_repel(aes(x = PC1, y = PC2, label = terms),
hjust = 0,
vjust = 1,
size = 5,
color = "red") +
labs(title = "Loadings on PCs 1 and 2 for normalized data") +
theme_classic()
# Scores plot #####
# PCA SCORES are in bake
pc1pc2_scores_plot <- wines_bake %>%
ggplot(aes(PC1, PC2)) +
geom_point(aes(color = vintages, shape = vintages),
alpha = 0.8, size = 2) +
scale_color_manual(values = c("deepskyblue4", "darkorange", "purple")) +
labs(title = "Scores on PCs 1 and 2 for normalized data",
x = "PC1 (36%)",
y = "PC2 (19.2%)") +
theme_classic() +
theme(legend.position = "none")
grid.arrange(pc1pc2_scores_plot, pca_loadings_only, ncol = 2)
wines %>%
group_by(vintages) %>%
summarise(across(c(flavonoids, col_int, ash, malic_acid),
mean,
na.rm = T))
# A tibble: 3 x 5
vintages flavonoids col_int ash malic_acid
<fct> <dbl> <dbl> <dbl> <dbl>
1 Barbera 0.781 7.40 2.44 3.33
2 Barolo 2.98 5.53 2.46 2.02
3 Grignolino 2.08 3.09 2.24 1.93
PCA allows for exploratory characterizing of x variables that are associated with each other.
PC1: flavanoids, total phenols, OD_ratio. PC2: color intensity, alcohol, proline PC3: ash, ash_alkalinity PC4: malic acid (by right 3 components are sufficient)
Barbera, indicated in blue, has the largest score on PC 1 and PC2. Barolo, indicated in orange, has the smallest score on PC 1. Grignolo, indicated in purple, has the lowest score on PC 2.
Barbera has low flavonoids, high col_int and high malic acid Barolo has high flavonoids, medium col_int and intermediate malic acid Grignolino has intermediate flavonoids, high col_int and low malic acid.
For attribution, please cite this work as
lruolin (2021, Jan. 23). pRactice corner: PCA Wine. Retrieved from https://lruolin.github.io/myBlog/posts/20210123_PCA wine/
BibTeX citation
@misc{lruolin2021pca, author = {lruolin, }, title = {pRactice corner: PCA Wine}, url = {https://lruolin.github.io/myBlog/posts/20210123_PCA wine/}, year = {2021} }